In [1]:

    
using GLM
using Statistics
using DataFrames
using Plots
using Random
plotly()









    



┌ Info: Precompiling GLM [38e38edf-8417-5370-95a0-9cbb8c7f171a]
└ @ Base loading.jl:1260
┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1260
┌ Info: Precompiling ORCA [47be7bcc-f1a6-5447-8b36-7eeeff7534fd]
└ @ Base loading.jl:1260
ERROR: LoadError: ORCA not installed properly. Please call `Pkg.build("ORCA")`
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] top-level scope at C:\Users\juank\.julia\packages\ORCA\fiEmb\src\ORCA.jl:10
 [3] include(::Module, ::String) at .\Base.jl:377
 [4] top-level scope at none:2
 [5] eval at .\boot.jl:331 [inlined]
 [6] eval(::Expr) at .\client.jl:449
 [7] top-level scope at .\none:3
in expression starting at C:\Users\juank\.julia\packages\ORCA\fiEmb\src\ORCA.jl:7
┌ Info: For saving to png with the Plotly backend ORCA has to be installed.
└ @ Plots C:\Users\juank\.julia\packages\Plots\NVH6y\src\backends.jl:363






    Out[1]:





Plots.PlotlyBackend()

Linear regression

What is linear regression?

Linear regression is a type of statistical modeling. Modeling, as the name suggest, creates a model from data. A model takes new data and uses it for prediction.

In linear regression that prediction is a continuous numerical variable. This variable is termed the target variable or the dependent variable. One or feature variables or independent variables are used in the prediction. A model learns from the feature variable(s) to predict the target variable.

In the code below, we create two arrays. The first represents a feature variable and the second a target variable.



In [2]:

    
Random.seed!(1)  # For reproducible results

X = rand(Float16, 50)
y = X .* 10 .+ (2 * randn(50));



In [3]:

    
plot(X, y, seriestype = :scatter, title = "Scatter plot", label = "data")









    




    







    Out[3]:







    
        Plots.jl

One possible model is shown below (red line). The model is linear (a starigh line). The equation for a straight line (from school algebra) is $y = m x + c$. here, $m$ is the slope and $c$ is the $y$ intercept (where $x=0$. In statistics we typically use the symbols $m = \beta_1$ and $c = \beta_0$.



In [4]:

    
x_vals = collect([0:0.01:1])
plot!(x_vals, 10 * x_vals, label = "model")









    Out[4]:







    
        Plots.jl

The feature variable number is multiplied by $10$ ($m = \beta_{1} = 10$) and we add no intercept ($\beta_{0} = 0$). Since this gives us a prediction of the target variable and not the actual target variable, we use the notation $\hat{y}$, (1).

$$ \begin{align} &\hat{y} = \beta_{0} + \beta_{1} X \\ &\hat{y}_{i} = 10 X_i \end{align} \tag{1} $$

Let's view the first $X$ variable value, $X_1$.



In [5]:

    
X[1]









    Out[5]:





Float16(0.1699)

It is $0.1699$. It's target value is found in y[1].



In [6]:

    
y[1]









    Out[6]:





2.6168006601074336

Our model predicts $\hat{y}_{1} = 10 {X}_{1} = 10 \times 0.1699 = 1.699$. This is a bit off. The difference between a preicted target value and the actual target value is called the error or the residual, denoted by the symbol $\epsilon$. We calculate the residual for the first entry in the data to be ${\epsilon}_{1} = 2.617 - 1.70 \approx 0.917$ To get the actual value back, we must add an error value to each predicted value, (2).

$$ {y}_{i} = 10 {X}_{i} + {\epsilon}_{i} \tag{2} $$

Below, we calculate the predicted values for all of our data and then the rseiduals.



In [7]:

    
y_pred = 10 * X
res = y - y_pred;

We can create a histogram of all the resiudals.



In [8]:

    
histogram(res, bins = 10, title = "Residuals", label = "residuals")









    Out[8]:







    
        Plots.jl

Another way to visualize the residuals is to look at is as a scatter plot. Each residual is above or below the $y=0$ line.



In [58]:

    
plot(collect([1:1:length(y)]), res, seriestype = :scatter, title = "Residuals", label = "residuals")









    Out[58]:







    
        Plots.jl

This model can now be used on new data. Given any feature variable value, the model will predict a target variable value. This is the red line in the model plot above.

To see how good our model is, we create a loss function.

Loss function

A loss function determines the difference between a set of predicted values and thhe actual values for the target variable. In linear regression, we typically use the mean square error (MSE), (3).

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} { \left( \hat{y}_{i} - {y}_{i} \right) }^{2} \\ \tag{3} $$

Wriite out wil expansion and the calculations for $\hat{y}$, we have (4).

$$ MSE = \frac{{\left[ \left( \beta_{0} + \beta_{1} {X}_{1} \right) - {y}_{1} \right]}^{2} + {\left[ \left( \beta_{0} + \beta_{1} {X}_{2} \right) - {y}_{2} \right]}^{2} + {\left[ \left( \beta_{0} + \beta_{1} {X}_{3} \right) - {y}_{3} \right]}^{2} + \ldots + {\left[ \left( \beta_{0} + \beta_{1} {X}_{n} \right) - {y}_{n} \right]}^{2}}{n} \tag{4} $$

Here $n$ is the sample size. We subtarct each actual target value from it predicted value and square this difference. This ensures a positive error term for each sample. We add all these squared errors and divide by the sample size to give us a mean squared error.

Let's look at the first three samples.



In [10]:

    
y_pred[1:3]









    Out[10]:





3-element Array{Float16,1}:
 1.699
 6.934
 5.86



In [11]:

    
y[1:3]









    Out[11]:





3-element Array{Float64,1}:
 2.6168006601074336
 5.888920235156983
 6.676166676649505



In [12]:

    
X[1:3]









    Out[12]:





3-element Array{Float16,1}:
 0.1699
 0.6934
 0.586

The squared differences are shown in (4).

$$ \frac{{\left( 1.699 - 2.617 \right)}^{2} + {\left( 6.934 - 5.889 \right)}^{2} + {\left( 5.860 - 6.676 \right)}^{2} + \ldots}{50} \tag{4} $$

The abbreviated calculation is shown in (5).

$$ \frac{{\left[ \left( \beta_{0} + \beta_{1} {X}_{1} \right) - {y}_{1} \right]}^{2} + {\left[ \left( \beta_{0} + \beta_{1} {X}_{2} \right) - {y}_{2} \right]}^{2} + {\left[ \left( \beta_{0} + \beta_{1} {X}_{3} \right) - {y}_{3} \right]}^{2} + \ldots + {\left[ \left( \beta_{0} + \beta_{1} {X}_{n} \right) - {y}_{n} \right]}^{2}}{n} \\ \frac{{\left\{ \left[ 0 + 10 \left( 1.699 \right) \right] - 2.617 \right\}}^{2} + \ldots }{n} \tag{5} $$

We can create a funtion to do the calculation of the MSE for us.



In [13]:

    
function mse(hat_y , y)
    return sum((hat_y - y).^2) / length(y)
end









    Out[13]:





mse (generic function with 1 method)



In [27]:

    
mse_model = mse(y_pred, y)
mse_model









    Out[27]:





3.9706423208066406

We see a MSE of $3.97$.

Perhaps, we can do a bit better. To explain how, we start with the worst possible model. This base model simply uses the mean of the target variable as the predicted value.

Base model

We create a new array of predicted values, based on the mean of the target variable.



In [15]:

    
mean(y)









    Out[15]:





4.706227465853342



In [16]:

    
y_pred_mean = repeat([mean(y)], length(y));

Let's look at the MSE of this model.



In [26]:

    
mse_mean_model = mse(y_pred_mean, y)
mse_mean_model









    Out[26]:





12.280217763711276

We can compare a model to the base model, (6).

$$ \text{comparison} = \frac{{\text{MSE}}_{\text{base model}} - {\text{MSE}}_{\text{model}}}{{\text{MSE}}_{\text{base model}}} \tag{6} $$

When we multiply the result by $100$ we express the comparison as a percentage. Because this is a ratio, the result will always be between $0$% and $100$%. We can can say the according to the model the feature variable explains comparison % of the variance target variable. Alternatively, we can state that the model reduces the variance in the base model's prediction by comparison %.

In our case the comparison is as shown below.



In [28]:

    
(mse_mean_model - mse_model) / mse_mean_model









    Out[28]:





0.6766635252560333

We can say that our model with $\beta_{0} = 0$ and $\beta_{1} = 10$ explains $67.7$% of the variance in the target variable.

Calculating the best parameters

https://discourse.julialang.org/t/how-to-use-a-minimise-mean-squared-loss-function-in-julia-for-gradient-descent/33439

$$ \frac{1}{2n} \left( {\beta_{0}^{2} - 2 \beta_{0} {y}_{1} +{y}_{1}^{2} + 2 \beta_{0} \beta_{1} {X}_{1} - 2 \beta_{1} X {y}_{1} + \beta_{1}^{2} {X}_{1}^{2}} + \ldots \right) \\ \frac{1}{2n} \sum_{i=1}^{n}{\beta_{0}^{2} - 2 \beta_{0} {y}_{i} +{y}_{i}^{2} + 2 \beta_{0} \beta_{1} {X}_{i} - 2 \beta_{1} {X}_{i} {y}_{i} + \beta_{1}^{2} {X}_{i}^{2}} \\ \frac{\partial}{\partial{\beta_{0}}} = \frac{1}{n} \sum_{i=1}^{n}{ \beta_{0} - {y}_{i} + \beta_{1} {X}_{i} } \\ \frac{\partial}{\partial{\beta_{1}}} = \frac{1}{n} \sum_{i=1}^{n} {\beta_{0} {X}_{i} - {X}_{i}{y}_{i} + \beta_{1} {X}_{i}^{2}} $$



In [30]:

    
function cost_function(X, y ,θ)
    n = size(X)[1]  # Sample size
    preds = X * θ  # Matrix multiplication
    loss = preds - y  # Calculate the loss vector
    cost = (1/(2n)) * (loss' * loss)  # Half mean squared cost
    return cost
end









    Out[30]:





cost_function (generic function with 1 method)



In [32]:

    
function linear_reg_grad_d(X, y, α, fit_intercept=true, n_iter=1000)
    m = length(y)  # Sample size (number of rows)
    
    if fit_intercept
        constant = ones(m, 1)
        X = hcat(constant, X)
    else
        X
    end
    
    n = size(X)[2]  # Number of features (columns)
    θ = zeros(n)
    𝐉 = zeros(n_iter)  # initial cost vector
    
    for iter in range(1, stop=n_iter)
        pred = X * θ
        𝐉[iter] = cost_function(X, y, θ)
        θ = θ - ((α/m) * X') * (pred - y);
    end
    return (θ, 𝐉)
end









    Out[32]:





linear_reg_grad_d (generic function with 3 methods)



In [48]:

    
LinRange(1, 10, 10)









    Out[48]:





10-element LinRange{Float64}:
 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0



In [56]:

    
cost_values = linear_reg_grad_d(X, y, 0.01)[2]
time_step = LinRange(1, 1000, 1000);



In [57]:

    
scatter(x = time_step, y = cost_values)









    Out[57]:







    
        Plots.jl



In [18]:

    
df = DataFrame(Feature = X, Target = y)
first(df, 5)









    Out[18]:




5 rows × 2 columns
Feature Target
Float16 Float64
1 0.1699 2.6168
2 0.6934 5.88892
3 0.586 6.67617
4 0.10156 0.914723
5 0.8994 7.60488



In [19]:

    
ols = lm(@formula(Target ~ Feature), df)









    Out[19]:





StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Target ~ 1 + Feature

Coefficients:
─────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error    t value  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.151907    0.552258  -0.275066    0.7844   -1.2623    0.958481
Feature       9.64955     0.941615  10.2479      <1e-12    7.75631  11.5428
─────────────────────────────────────────────────────────────────────────────



In [25]:

    
exp(9.64955)









    Out[25]:





15514.804870895368



In [20]:

    
y_pred_best = predict(ols);



In [21]:

    
mse(y_pred_best, y)









    Out[21]:





3.8521425072619953



In [22]:

    
(12.280217763711272 - 3.8521425072619944) / 12.280217763711272









    Out[22]:





0.6863131760867229



In [23]:

    
r2(ols)









    Out[23]:





0.6863131760867229



In [ ]:

	Feature	Target
	Float16	Float64
1	0.1699	2.6168
2	0.6934	5.88892
3	0.586	6.67617
4	0.10156	0.914723
5	0.8994	7.60488